(Not so) Short tutorial on Zarr
¶

This tutorial is derived heavily from the folowing sources:

  1. Overview of Zarr - https://youtu.be/KiiKvXzhyMs
  2. Interactive Zarr tutorial - https://youtu.be/unGL07trSjA

Please give them a watch for a more detailed understanding.

Learning Goals¶

By the end of this tutorial, you should be able to

  • Identify the fundamental data structures in Zarr (Groups and Arrays) and the key properties of Arrays (shape, dtype, chunks, attributes)
  • Create Arrays and Groups in local files or in S3
  • Create and edit attributes (metadata)
  • Read and write data into Arrays
  • Evaluate the tradeoffs of different Array chunking strageies
  • Read and write NetCDF-style data to Zarr using Xarray
  • Do parallel processing on Xarray / Zarr data using Dask (time permitting)

Background
¶

  • Arrays are containers that hold items of the same data type and size.
  • The number of items are described by the shape.

What is Zarr?
¶

  • Zarr is an open-source specification for storing chunked, compressed, n-dimensional arrays.
  • Developed by Alistair Miles in 2015
  • The intention was to be able to work with big data capable of parallel read/writes.
  • Development of Zarr is fiscally sponsored by NumFOCUS. The project is are also funded by CZI under EOSS.

To learn more, follow them on social media:

Examples
¶

Source: https://twitter.com/LLC4320Bot/status/1389008447941775360?s=20
Source: https://twitter.com/notjustmoore/status/1256232842755014656?s=20

How does Zarr work?
¶

Divides array into chunks then compresses each chunk¶

Retrieve required chunks¶

Hierarchical organization¶

Working¶

credits: Trevor Manz

NumPy Review
¶

In [2]:
import numpy as np

# Creating an array
x = np.zeros(shape=(10,10), dtype='f4')
x.shape, x.dtype
Out[2]:
((10, 10), dtype('float32'))
In [3]:
# How much memory does the array use?
x.nbytes
Out[3]:
400
In [4]:
# Accessing data in the array
x[:5, :5]
Out[4]:
array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]], dtype=float32)
In [5]:
y = np.ones(shape=(20,30), dtype='f4')
y[:3,:3]
Out[5]:
array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]], dtype=float32)
In [6]:
y[:10,:10] = x
y[:3, :3]
Out[6]:
array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]], dtype=float32)
In [7]:
y[-3:, -3:]
Out[7]:
array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]], dtype=float32)

Zarr Basics
¶

Fundamental properties of Zarr arrays:

  • Shape
  • Data type
  • Chunks
  • Attributes
  • Filters (not covered here)
  • Compressors (not covered here)
In [8]:
import zarr
z = zarr.create(shape=(60, 80), dtype='f4', chunks=(10,10), store='test.zarr')
z
Out[8]:
<zarr.core.Array (60, 80) float32>
In [9]:
z.info
Out[9]:
Typezarr.core.Array
Data typefloat32
Shape(60, 80)
Chunk shape(10, 10)
OrderC
Read-onlyFalse
CompressorBlosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store typezarr.storage.DirectoryStore
No. bytes19200 (18.8K)
No. bytes stored337
Storage ratio57.0
Chunks initialized0/48
In [10]:
z.fill_value
Out[10]:
0.0
In [11]:
z[10, 12]
Out[11]:
0.0
In [12]:
z[:] = 42
In [13]:
z.info
Out[13]:
Typezarr.core.Array
Data typefloat32
Shape(60, 80)
Chunk shape(10, 10)
OrderC
Read-onlyFalse
CompressorBlosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store typezarr.storage.DirectoryStore
No. bytes19200 (18.8K)
No. bytes stored2497 (2.4K)
Storage ratio7.7
Chunks initialized48/48

Zarr Attributes
¶

We can attach arbitrary metadata to an array by making use of attributes

In [14]:
z.attrs['units'] = 'degC'
dict(z.attrs)
Out[14]:
{'units': 'degC'}
In [15]:
z.store
Out[15]:
<zarr.storage.DirectoryStore at 0x112f33850>
In [16]:
!tree -a test.zarr | head
test.zarr
├── .zarray
├── .zattrs
├── 0.0
├── 0.1
├── 0.2
├── 0.3
├── 0.4
├── 0.5
├── 0.6
In [17]:
import json
with open("test.zarr/.zarray") as fp:
    display(json.load(fp))
{'chunks': [10, 10],
 'compressor': {'blocksize': 0,
  'clevel': 5,
  'cname': 'lz4',
  'id': 'blosc',
  'shuffle': 1},
 'dtype': '<f4',
 'fill_value': 0.0,
 'filters': None,
 'order': 'C',
 'shape': [60, 80],
 'zarr_format': 2}
In [18]:
import json
with open("test.zarr/.zattrs") as fp:
    display(json.load(fp))
{'units': 'degC'}

Chunking¶

Chunking is the main parameter that we control as the user when creating or working with zarr arrays. Choice of chunks could impact the performance to a good extent. There are 2 main points to be considered regarding chunking:

  1. Concurrent writes can occur as long as the same chunks are not being touched by multiple processes.
  2. For data retrieveal the entire chunk will be downloaded even if you need only a piece of data within the chunk.

Let's compare a couple of chunking strategies:

In [19]:
a = zarr.create(shape=(100, 100, 100), chunks=(1, 100, 100), dtype='f8', store="a.zarr")
a[:] = np.random.randn(*a.shape)
In [20]:
%time _ = a[:, 0, 0]
CPU times: user 12 ms, sys: 6.54 ms, total: 18.5 ms
Wall time: 17.6 ms
In [21]:
b = zarr.create(shape=(100, 100, 100), chunks=(100, 100, 1), dtype='f8', store="b.zarr")
b[:] = np.random.randn(*b.shape)
In [22]:
%time _ = b[:, 0, 0]
CPU times: user 1.2 ms, sys: 997 µs, total: 2.2 ms
Wall time: 1.51 ms

Transforming from one chunking strategy to another is not a trivial problem. Refer to this super useful package to help with the this particular problem.

Reading and writing to s3
¶

In [23]:
import uuid
my_folder = f"s3://wxml/kushal_test/{uuid.uuid4().hex}"
my_folder
Out[23]:
's3://wxml/kushal_test/0ec1566eaa5c4331b0cff11d3d179f3d'

Now we create a store in the bucket and store a group in it¶

In [24]:
target = f"{my_folder}/test.zarr"
store = zarr.storage.FSStore(target)

group = zarr.group(store=store)

group.create(name="foo", shape=(100, 100), chunks=(10, 10), dtype='f4')
group.create(name="baz", shape=(100, 100), chunks=(20, 20), dtype='i4')
group
Out[24]:
<zarr.hierarchy.Group '/'>
In [25]:
group.foo[:] = np.random.rand(*group.foo.shape)
group.foo.info
Out[25]:
Name/foo
Typezarr.core.Array
Data typefloat32
Shape(100, 100)
Chunk shape(10, 10)
OrderC
Read-onlyFalse
CompressorBlosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store typezarr.storage.FSStore
No. bytes40000 (39.1K)
No. bytes stored41843 (40.9K)
Storage ratio1.0
Chunks initialized100/100

Xarray + Zarr
¶

In [26]:
import xarray as xr
import hvplot.xarray as hvx
 
ds = xr.tutorial.open_dataset("air_temperature")
ds
Out[26]:
<xarray.Dataset>
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 ...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
xarray.Dataset
    • lat: 25
    • time: 2920
    • lon: 53
    • lat
      (lat)
      float32
      75.0 72.5 70.0 ... 20.0 17.5 15.0
      standard_name :
      latitude
      long_name :
      Latitude
      units :
      degrees_north
      axis :
      Y
      array([75. , 72.5, 70. , 67.5, 65. , 62.5, 60. , 57.5, 55. , 52.5, 50. , 47.5,
             45. , 42.5, 40. , 37.5, 35. , 32.5, 30. , 27.5, 25. , 22.5, 20. , 17.5,
             15. ], dtype=float32)
    • lon
      (lon)
      float32
      200.0 202.5 205.0 ... 327.5 330.0
      standard_name :
      longitude
      long_name :
      Longitude
      units :
      degrees_east
      axis :
      X
      array([200. , 202.5, 205. , 207.5, 210. , 212.5, 215. , 217.5, 220. , 222.5,
             225. , 227.5, 230. , 232.5, 235. , 237.5, 240. , 242.5, 245. , 247.5,
             250. , 252.5, 255. , 257.5, 260. , 262.5, 265. , 267.5, 270. , 272.5,
             275. , 277.5, 280. , 282.5, 285. , 287.5, 290. , 292.5, 295. , 297.5,
             300. , 302.5, 305. , 307.5, 310. , 312.5, 315. , 317.5, 320. , 322.5,
             325. , 327.5, 330. ], dtype=float32)
    • time
      (time)
      datetime64[ns]
      2013-01-01 ... 2014-12-31T18:00:00
      standard_name :
      time
      long_name :
      Time
      array(['2013-01-01T00:00:00.000000000', '2013-01-01T06:00:00.000000000',
             '2013-01-01T12:00:00.000000000', ..., '2014-12-31T06:00:00.000000000',
             '2014-12-31T12:00:00.000000000', '2014-12-31T18:00:00.000000000'],
            dtype='datetime64[ns]')
    • air
      (time, lat, lon)
      float32
      ...
      long_name :
      4xDaily Air temperature at sigma level 995
      units :
      degK
      precision :
      2
      GRIB_id :
      11
      GRIB_name :
      TMP
      var_desc :
      Air temperature
      dataset :
      NMC Reanalysis
      level_desc :
      Surface
      statistic :
      Individual Obs
      parent_stat :
      Other
      actual_range :
      [185.16 322.1 ]
      [3869000 values with dtype=float32]
    • lat
      PandasIndex
      PandasIndex(Index([75.0, 72.5, 70.0, 67.5, 65.0, 62.5, 60.0, 57.5, 55.0, 52.5, 50.0, 47.5,
             45.0, 42.5, 40.0, 37.5, 35.0, 32.5, 30.0, 27.5, 25.0, 22.5, 20.0, 17.5,
             15.0],
            dtype='float32', name='lat'))
    • lon
      PandasIndex
      PandasIndex(Index([200.0, 202.5, 205.0, 207.5, 210.0, 212.5, 215.0, 217.5, 220.0, 222.5,
             225.0, 227.5, 230.0, 232.5, 235.0, 237.5, 240.0, 242.5, 245.0, 247.5,
             250.0, 252.5, 255.0, 257.5, 260.0, 262.5, 265.0, 267.5, 270.0, 272.5,
             275.0, 277.5, 280.0, 282.5, 285.0, 287.5, 290.0, 292.5, 295.0, 297.5,
             300.0, 302.5, 305.0, 307.5, 310.0, 312.5, 315.0, 317.5, 320.0, 322.5,
             325.0, 327.5, 330.0],
            dtype='float32', name='lon'))
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['2013-01-01 00:00:00', '2013-01-01 06:00:00',
                     '2013-01-01 12:00:00', '2013-01-01 18:00:00',
                     '2013-01-02 00:00:00', '2013-01-02 06:00:00',
                     '2013-01-02 12:00:00', '2013-01-02 18:00:00',
                     '2013-01-03 00:00:00', '2013-01-03 06:00:00',
                     ...
                     '2014-12-29 12:00:00', '2014-12-29 18:00:00',
                     '2014-12-30 00:00:00', '2014-12-30 06:00:00',
                     '2014-12-30 12:00:00', '2014-12-30 18:00:00',
                     '2014-12-31 00:00:00', '2014-12-31 06:00:00',
                     '2014-12-31 12:00:00', '2014-12-31 18:00:00'],
                    dtype='datetime64[ns]', name='time', length=2920, freq=None))
  • Conventions :
    COARDS
    title :
    4x daily NMC reanalysis (1948)
    description :
    Data is from NMC initialized reanalysis (4x/day). These are the 0.9950 sigma level values.
    platform :
    Model
    references :
    http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.html
In [27]:
ds.air.hvplot(x='lon', y='lat', cmap='magma')
Out[27]:

Writing Zarr from xarray¶

In [28]:
ds_chunked = ds.chunk({'time': 100})
ds_chunked
Out[28]:
<xarray.Dataset>
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 dask.array<chunksize=(100, 25, 53), meta=np.ndarray>
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
xarray.Dataset
    • lat: 25
    • time: 2920
    • lon: 53
    • lat
      (lat)
      float32
      75.0 72.5 70.0 ... 20.0 17.5 15.0
      standard_name :
      latitude
      long_name :
      Latitude
      units :
      degrees_north
      axis :
      Y
      array([75. , 72.5, 70. , 67.5, 65. , 62.5, 60. , 57.5, 55. , 52.5, 50. , 47.5,
             45. , 42.5, 40. , 37.5, 35. , 32.5, 30. , 27.5, 25. , 22.5, 20. , 17.5,
             15. ], dtype=float32)
    • lon
      (lon)
      float32
      200.0 202.5 205.0 ... 327.5 330.0
      standard_name :
      longitude
      long_name :
      Longitude
      units :
      degrees_east
      axis :
      X
      array([200. , 202.5, 205. , 207.5, 210. , 212.5, 215. , 217.5, 220. , 222.5,
             225. , 227.5, 230. , 232.5, 235. , 237.5, 240. , 242.5, 245. , 247.5,
             250. , 252.5, 255. , 257.5, 260. , 262.5, 265. , 267.5, 270. , 272.5,
             275. , 277.5, 280. , 282.5, 285. , 287.5, 290. , 292.5, 295. , 297.5,
             300. , 302.5, 305. , 307.5, 310. , 312.5, 315. , 317.5, 320. , 322.5,
             325. , 327.5, 330. ], dtype=float32)
    • time
      (time)
      datetime64[ns]
      2013-01-01 ... 2014-12-31T18:00:00
      standard_name :
      time
      long_name :
      Time
      array(['2013-01-01T00:00:00.000000000', '2013-01-01T06:00:00.000000000',
             '2013-01-01T12:00:00.000000000', ..., '2014-12-31T06:00:00.000000000',
             '2014-12-31T12:00:00.000000000', '2014-12-31T18:00:00.000000000'],
            dtype='datetime64[ns]')
    • air
      (time, lat, lon)
      float32
      dask.array<chunksize=(100, 25, 53), meta=np.ndarray>
      long_name :
      4xDaily Air temperature at sigma level 995
      units :
      degK
      precision :
      2
      GRIB_id :
      11
      GRIB_name :
      TMP
      var_desc :
      Air temperature
      dataset :
      NMC Reanalysis
      level_desc :
      Surface
      statistic :
      Individual Obs
      parent_stat :
      Other
      actual_range :
      [185.16 322.1 ]
      Array Chunk
      Bytes 14.76 MiB 517.58 kiB
      Shape (2920, 25, 53) (100, 25, 53)
      Dask graph 30 chunks in 2 graph layers
      Data type float32 numpy.ndarray
      53 25 2920
    • lat
      PandasIndex
      PandasIndex(Index([75.0, 72.5, 70.0, 67.5, 65.0, 62.5, 60.0, 57.5, 55.0, 52.5, 50.0, 47.5,
             45.0, 42.5, 40.0, 37.5, 35.0, 32.5, 30.0, 27.5, 25.0, 22.5, 20.0, 17.5,
             15.0],
            dtype='float32', name='lat'))
    • lon
      PandasIndex
      PandasIndex(Index([200.0, 202.5, 205.0, 207.5, 210.0, 212.5, 215.0, 217.5, 220.0, 222.5,
             225.0, 227.5, 230.0, 232.5, 235.0, 237.5, 240.0, 242.5, 245.0, 247.5,
             250.0, 252.5, 255.0, 257.5, 260.0, 262.5, 265.0, 267.5, 270.0, 272.5,
             275.0, 277.5, 280.0, 282.5, 285.0, 287.5, 290.0, 292.5, 295.0, 297.5,
             300.0, 302.5, 305.0, 307.5, 310.0, 312.5, 315.0, 317.5, 320.0, 322.5,
             325.0, 327.5, 330.0],
            dtype='float32', name='lon'))
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['2013-01-01 00:00:00', '2013-01-01 06:00:00',
                     '2013-01-01 12:00:00', '2013-01-01 18:00:00',
                     '2013-01-02 00:00:00', '2013-01-02 06:00:00',
                     '2013-01-02 12:00:00', '2013-01-02 18:00:00',
                     '2013-01-03 00:00:00', '2013-01-03 06:00:00',
                     ...
                     '2014-12-29 12:00:00', '2014-12-29 18:00:00',
                     '2014-12-30 00:00:00', '2014-12-30 06:00:00',
                     '2014-12-30 12:00:00', '2014-12-30 18:00:00',
                     '2014-12-31 00:00:00', '2014-12-31 06:00:00',
                     '2014-12-31 12:00:00', '2014-12-31 18:00:00'],
                    dtype='datetime64[ns]', name='time', length=2920, freq=None))
  • Conventions :
    COARDS
    title :
    4x daily NMC reanalysis (1948)
    description :
    Data is from NMC initialized reanalysis (4x/day). These are the 0.9950 sigma level values.
    platform :
    Model
    references :
    http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.html
In [29]:
path = f"{my_folder}/air_temp.zarr"

from dask.diagnostics import ProgressBar
with ProgressBar():
    ds_chunked.to_zarr(path)
/Users/kushal/Downloads/Analysis_tools/mambaforge/envs/metenv/lib/python3.10/site-packages/xarray/core/dataset.py:2105: SerializationWarning: saving variable None with floating point data as an integer dtype without any _FillValue to use for NaNs
  return to_zarr(  # type: ignore[call-overload,misc]
[########################################] | 100% Completed | 6.51 ss
In [30]:
path
Out[30]:
's3://wxml/kushal_test/0ec1566eaa5c4331b0cff11d3d179f3d/air_temp.zarr'
In [31]:
ds_from_s3 = xr.open_dataset(path, engine="zarr", chunks='auto')
ds_from_s3
Out[31]:
<xarray.Dataset>
Dimensions:  (time: 2920, lat: 25, lon: 53)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 dask.array<chunksize=(2920, 25, 53), meta=np.ndarray>
Attributes:
    Conventions:  COARDS
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
    title:        4x daily NMC reanalysis (1948)
xarray.Dataset
    • time: 2920
    • lat: 25
    • lon: 53
    • lat
      (lat)
      float32
      75.0 72.5 70.0 ... 20.0 17.5 15.0
      axis :
      Y
      long_name :
      Latitude
      standard_name :
      latitude
      units :
      degrees_north
      array([75. , 72.5, 70. , 67.5, 65. , 62.5, 60. , 57.5, 55. , 52.5, 50. , 47.5,
             45. , 42.5, 40. , 37.5, 35. , 32.5, 30. , 27.5, 25. , 22.5, 20. , 17.5,
             15. ], dtype=float32)
    • lon
      (lon)
      float32
      200.0 202.5 205.0 ... 327.5 330.0
      axis :
      X
      long_name :
      Longitude
      standard_name :
      longitude
      units :
      degrees_east
      array([200. , 202.5, 205. , 207.5, 210. , 212.5, 215. , 217.5, 220. , 222.5,
             225. , 227.5, 230. , 232.5, 235. , 237.5, 240. , 242.5, 245. , 247.5,
             250. , 252.5, 255. , 257.5, 260. , 262.5, 265. , 267.5, 270. , 272.5,
             275. , 277.5, 280. , 282.5, 285. , 287.5, 290. , 292.5, 295. , 297.5,
             300. , 302.5, 305. , 307.5, 310. , 312.5, 315. , 317.5, 320. , 322.5,
             325. , 327.5, 330. ], dtype=float32)
    • time
      (time)
      datetime64[ns]
      2013-01-01 ... 2014-12-31T18:00:00
      long_name :
      Time
      standard_name :
      time
      array(['2013-01-01T00:00:00.000000000', '2013-01-01T06:00:00.000000000',
             '2013-01-01T12:00:00.000000000', ..., '2014-12-31T06:00:00.000000000',
             '2014-12-31T12:00:00.000000000', '2014-12-31T18:00:00.000000000'],
            dtype='datetime64[ns]')
    • air
      (time, lat, lon)
      float32
      dask.array<chunksize=(2920, 25, 53), meta=np.ndarray>
      GRIB_id :
      11
      GRIB_name :
      TMP
      actual_range :
      [185.16000366210938, 322.1000061035156]
      dataset :
      NMC Reanalysis
      level_desc :
      Surface
      long_name :
      4xDaily Air temperature at sigma level 995
      parent_stat :
      Other
      precision :
      2
      statistic :
      Individual Obs
      units :
      degK
      var_desc :
      Air temperature
      Array Chunk
      Bytes 14.76 MiB 14.76 MiB
      Shape (2920, 25, 53) (2920, 25, 53)
      Dask graph 1 chunks in 2 graph layers
      Data type float32 numpy.ndarray
      53 25 2920
    • lat
      PandasIndex
      PandasIndex(Index([75.0, 72.5, 70.0, 67.5, 65.0, 62.5, 60.0, 57.5, 55.0, 52.5, 50.0, 47.5,
             45.0, 42.5, 40.0, 37.5, 35.0, 32.5, 30.0, 27.5, 25.0, 22.5, 20.0, 17.5,
             15.0],
            dtype='float32', name='lat'))
    • lon
      PandasIndex
      PandasIndex(Index([200.0, 202.5, 205.0, 207.5, 210.0, 212.5, 215.0, 217.5, 220.0, 222.5,
             225.0, 227.5, 230.0, 232.5, 235.0, 237.5, 240.0, 242.5, 245.0, 247.5,
             250.0, 252.5, 255.0, 257.5, 260.0, 262.5, 265.0, 267.5, 270.0, 272.5,
             275.0, 277.5, 280.0, 282.5, 285.0, 287.5, 290.0, 292.5, 295.0, 297.5,
             300.0, 302.5, 305.0, 307.5, 310.0, 312.5, 315.0, 317.5, 320.0, 322.5,
             325.0, 327.5, 330.0],
            dtype='float32', name='lon'))
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['2013-01-01 00:00:00', '2013-01-01 06:00:00',
                     '2013-01-01 12:00:00', '2013-01-01 18:00:00',
                     '2013-01-02 00:00:00', '2013-01-02 06:00:00',
                     '2013-01-02 12:00:00', '2013-01-02 18:00:00',
                     '2013-01-03 00:00:00', '2013-01-03 06:00:00',
                     ...
                     '2014-12-29 12:00:00', '2014-12-29 18:00:00',
                     '2014-12-30 00:00:00', '2014-12-30 06:00:00',
                     '2014-12-30 12:00:00', '2014-12-30 18:00:00',
                     '2014-12-31 00:00:00', '2014-12-31 06:00:00',
                     '2014-12-31 12:00:00', '2014-12-31 18:00:00'],
                    dtype='datetime64[ns]', name='time', length=2920, freq=None))
  • Conventions :
    COARDS
    description :
    Data is from NMC initialized reanalysis (4x/day). These are the 0.9950 sigma level values.
    platform :
    Model
    references :
    http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.html
    title :
    4x daily NMC reanalysis (1948)
In [32]:
url = "s3://cmip6-pds/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/historical/r1i1p1f1/day/pr/gr1/v20180701/"
ds2 = xr.open_dataset(url, engine='zarr', backend_kwargs={'storage_options': {'anon': True}})
ds2
Out[32]:
<xarray.Dataset>
Dimensions:    (lat: 180, bnds: 2, lon: 288, time: 60225)
Coordinates:
  * lat        (lat) float64 -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5
    lat_bnds   (lat, bnds) float64 ...
  * lon        (lon) float64 0.625 1.875 3.125 4.375 ... 355.6 356.9 358.1 359.4
    lon_bnds   (lon, bnds) float64 ...
  * time       (time) object 1850-01-01 12:00:00 ... 2014-12-31 12:00:00
    time_bnds  (time, bnds) object ...
Dimensions without coordinates: bnds
Data variables:
    pr         (time, lat, lon) float32 ...
Attributes: (12/49)
    Conventions:            CF-1.7 CMIP-6.0 UGRID-1.0
    activity_id:            CMIP
    branch_method:          standard
    branch_time_in_child:   0.0
    branch_time_in_parent:  36500.0
    comment:                <null ref>
    ...                     ...
    variable_id:            pr
    variant_info:           N/A
    variant_label:          r1i1p1f1
    status:                 2019-09-17;created;by nhn2@columbia.edu
    netcdf_tracking_ids:    hdl:21.14100/d4ce73dd-d8e0-44ef-847a-b957a138daf6...
    version_id:             v20180701
xarray.Dataset
    • lat: 180
    • bnds: 2
    • lon: 288
    • time: 60225
    • lat
      (lat)
      float64
      -89.5 -88.5 -87.5 ... 88.5 89.5
      axis :
      Y
      bounds :
      lat_bnds
      cell_methods :
      time: point
      long_name :
      latitude
      standard_name :
      latitude
      units :
      degrees_north
      array([-89.5, -88.5, -87.5, -86.5, -85.5, -84.5, -83.5, -82.5, -81.5, -80.5,
             -79.5, -78.5, -77.5, -76.5, -75.5, -74.5, -73.5, -72.5, -71.5, -70.5,
             -69.5, -68.5, -67.5, -66.5, -65.5, -64.5, -63.5, -62.5, -61.5, -60.5,
             -59.5, -58.5, -57.5, -56.5, -55.5, -54.5, -53.5, -52.5, -51.5, -50.5,
             -49.5, -48.5, -47.5, -46.5, -45.5, -44.5, -43.5, -42.5, -41.5, -40.5,
             -39.5, -38.5, -37.5, -36.5, -35.5, -34.5, -33.5, -32.5, -31.5, -30.5,
             -29.5, -28.5, -27.5, -26.5, -25.5, -24.5, -23.5, -22.5, -21.5, -20.5,
             -19.5, -18.5, -17.5, -16.5, -15.5, -14.5, -13.5, -12.5, -11.5, -10.5,
              -9.5,  -8.5,  -7.5,  -6.5,  -5.5,  -4.5,  -3.5,  -2.5,  -1.5,  -0.5,
               0.5,   1.5,   2.5,   3.5,   4.5,   5.5,   6.5,   7.5,   8.5,   9.5,
              10.5,  11.5,  12.5,  13.5,  14.5,  15.5,  16.5,  17.5,  18.5,  19.5,
              20.5,  21.5,  22.5,  23.5,  24.5,  25.5,  26.5,  27.5,  28.5,  29.5,
              30.5,  31.5,  32.5,  33.5,  34.5,  35.5,  36.5,  37.5,  38.5,  39.5,
              40.5,  41.5,  42.5,  43.5,  44.5,  45.5,  46.5,  47.5,  48.5,  49.5,
              50.5,  51.5,  52.5,  53.5,  54.5,  55.5,  56.5,  57.5,  58.5,  59.5,
              60.5,  61.5,  62.5,  63.5,  64.5,  65.5,  66.5,  67.5,  68.5,  69.5,
              70.5,  71.5,  72.5,  73.5,  74.5,  75.5,  76.5,  77.5,  78.5,  79.5,
              80.5,  81.5,  82.5,  83.5,  84.5,  85.5,  86.5,  87.5,  88.5,  89.5])
    • lat_bnds
      (lat, bnds)
      float64
      ...
      axis :
      Y
      long_name :
      latitude bounds
      units :
      degrees_north
      [360 values with dtype=float64]
    • lon
      (lon)
      float64
      0.625 1.875 3.125 ... 358.1 359.4
      axis :
      X
      bounds :
      lon_bnds
      cell_methods :
      time: point
      long_name :
      longitude
      standard_name :
      longitude
      units :
      degrees_east
      array([  0.625,   1.875,   3.125, ..., 356.875, 358.125, 359.375])
    • lon_bnds
      (lon, bnds)
      float64
      ...
      axis :
      X
      long_name :
      longitude bounds
      units :
      degrees_east
      [576 values with dtype=float64]
    • time
      (time)
      object
      1850-01-01 12:00:00 ... 2014-12-...
      axis :
      T
      bounds :
      time_bnds
      calendar_type :
      noleap
      description :
      Temporal mean
      long_name :
      time
      standard_name :
      time
      array([cftime.DatetimeNoLeap(1850, 1, 1, 12, 0, 0, 0, has_year_zero=True),
             cftime.DatetimeNoLeap(1850, 1, 2, 12, 0, 0, 0, has_year_zero=True),
             cftime.DatetimeNoLeap(1850, 1, 3, 12, 0, 0, 0, has_year_zero=True), ...,
             cftime.DatetimeNoLeap(2014, 12, 29, 12, 0, 0, 0, has_year_zero=True),
             cftime.DatetimeNoLeap(2014, 12, 30, 12, 0, 0, 0, has_year_zero=True),
             cftime.DatetimeNoLeap(2014, 12, 31, 12, 0, 0, 0, has_year_zero=True)],
            dtype=object)
    • time_bnds
      (time, bnds)
      object
      ...
      long_name :
      time axis boundaries
      [120450 values with dtype=object]
    • pr
      (time, lat, lon)
      float32
      ...
      cell_measures :
      area: areacella
      cell_methods :
      area: time: mean
      interp_method :
      conserve_order1
      long_name :
      Precipitation
      original_name :
      pr
      standard_name :
      precipitation_flux
      units :
      kg m-2 s-1
      [3122064000 values with dtype=float32]
    • lat
      PandasIndex
      PandasIndex(Index([-89.5, -88.5, -87.5, -86.5, -85.5, -84.5, -83.5, -82.5, -81.5, -80.5,
             ...
              80.5,  81.5,  82.5,  83.5,  84.5,  85.5,  86.5,  87.5,  88.5,  89.5],
            dtype='float64', name='lat', length=180))
    • lon
      PandasIndex
      PandasIndex(Index([             0.625, 1.8749999999999998,              3.125,
                          4.375,              5.625,              6.875,
                          8.125,              9.375,             10.625,
                         11.875,
             ...
                        348.125,            349.375,            350.625,
                        351.875,            353.125,            354.375,
                        355.625,            356.875,            358.125,
                        359.375],
            dtype='float64', name='lon', length=288))
    • time
      PandasIndex
      PandasIndex(CFTimeIndex([1850-01-01 12:00:00, 1850-01-02 12:00:00, 1850-01-03 12:00:00,
                   1850-01-04 12:00:00, 1850-01-05 12:00:00, 1850-01-06 12:00:00,
                   1850-01-07 12:00:00, 1850-01-08 12:00:00, 1850-01-09 12:00:00,
                   1850-01-10 12:00:00,
                   ...
                   2014-12-22 12:00:00, 2014-12-23 12:00:00, 2014-12-24 12:00:00,
                   2014-12-25 12:00:00, 2014-12-26 12:00:00, 2014-12-27 12:00:00,
                   2014-12-28 12:00:00, 2014-12-29 12:00:00, 2014-12-30 12:00:00,
                   2014-12-31 12:00:00],
                  dtype='object', length=60225, calendar='noleap', freq='D'))
  • Conventions :
    CF-1.7 CMIP-6.0 UGRID-1.0
    activity_id :
    CMIP
    branch_method :
    standard
    branch_time_in_child :
    0.0
    branch_time_in_parent :
    36500.0
    comment :
    <null ref>
    contact :
    gfdl.climate.model.info@noaa.gov
    creation_date :
    2019-02-27T00:30:18Z
    data_specs_version :
    01.00.27
    experiment :
    historical
    experiment_id :
    historical
    external_variables :
    areacella
    forcing_index :
    1
    frequency :
    day
    further_info_url :
    https://furtherinfo.es-doc.org/CMIP6.NOAA-GFDL.GFDL-CM4.historical.none.r1i1p1f1
    grid :
    atmos data regridded from Cubed-sphere (c96) to 180,288; interpolation method: conserve_order1
    grid_label :
    gr1
    history :
    File was processed by fremetar (GFDL analog of CMOR). TripleID: [exper_id_NIX4LXCc18,realiz_id_p1BNnppz5X,run_id_F8Xk5tsZcx]
    initialization_index :
    1
    institution :
    National Oceanic and Atmospheric Administration, Geophysical Fluid Dynamics Laboratory, Princeton, NJ 08540, USA
    institution_id :
    NOAA-GFDL
    license :
    CMIP6 model data produced by NOAA-GFDL is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (https://creativecommons.org/licenses/). Consult https://pcmdi.llnl.gov/CMIP6/TermsOfUse for terms of use governing CMIP6 output, including citation requirements and proper acknowledgment. Further information about this data, including some limitations, can be found via the further_info_url (recorded as a global attribute in this file). The data producers and data providers make no warranty, either express or implied, including, but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the supply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law.
    mip_era :
    CMIP6
    nominal_resolution :
    100 km
    parent_activity_id :
    CMIP
    parent_experiment_id :
    piControl
    parent_mip_era :
    CMIP6
    parent_source_id :
    GFDL-CM4
    parent_time_units :
    days since 0001-1-1
    parent_variant_label :
    r1i1p1f1
    physics_index :
    1
    product :
    model-output
    realization_index :
    1
    realm :
    atmos
    references :
    see further_info_url attribute
    source :
    GFDL-CM4 (2018): aerosol: interactive atmos: GFDL-AM4.0.1 (Cubed-sphere (c96) - 1 degree nominal horizontal resolution; 360 x 180 longitude/latitude; 33 levels; top level 1 hPa) atmosChem: fast chemistry, aerosol only land: GFDL-LM4.0.1 (1 degree nominal horizontal resolution; 360 x 180 longitude/latitude; 20 levels; bot level 10m); land:Veg:unnamed (dynamic vegetation, dynamic land use); land:Hydro:unnamed (soil water and ice, multi-layer snow, rivers and lakes) landIce: GFDL-LM4.0.1 ocean: GFDL-OM4p25 (GFDL-MOM6, tripolar - nominal 0.25 deg; 1440 x 1080 longitude/latitude; 75 levels; top grid cell 0-2 m) ocnBgchem: GFDL-BLINGv2 seaIce: GFDL-SIM4p25 (GFDL-SIS2.0, tripolar - nominal 0.25 deg; 1440 x 1080 longitude/latitude; 5 layers; 5 thickness categories) (GFDL ID: 2019_0065)
    source_id :
    GFDL-CM4
    source_type :
    AOGCM
    sub_experiment :
    none
    sub_experiment_id :
    none
    table_id :
    day
    title :
    NOAA GFDL GFDL-CM4 model output prepared for CMIP6 historical
    tracking_id :
    hdl:21.14100/d4ce73dd-d8e0-44ef-847a-b957a138daf6 hdl:21.14100/43de58d9-7fca-4f63-91d1-1ec2c704e4a8 hdl:21.14100/3e889330-e2a9-45d2-8cfb-a4f13ae448a0 hdl:21.14100/ad8f99d7-2a9c-4cc4-977e-304127d2e686 hdl:21.14100/56d5e05d-fe10-470f-88c5-31cdc101d534 hdl:21.14100/a80cee19-16fc-4699-a400-1d41c302f2ff hdl:21.14100/01014f2e-4fab-44a5-980d-fde3476ee6f8 hdl:21.14100/bf454391-5646-49ef-8bd7-b5aa521ff154 hdl:21.14100/79614c61-3be9-4e6a-994e-cfcb5b4b6ab5
    variable_id :
    pr
    variant_info :
    N/A
    variant_label :
    r1i1p1f1
    status :
    2019-09-17;created;by nhn2@columbia.edu
    netcdf_tracking_ids :
    hdl:21.14100/d4ce73dd-d8e0-44ef-847a-b957a138daf6 hdl:21.14100/43de58d9-7fca-4f63-91d1-1ec2c704e4a8 hdl:21.14100/3e889330-e2a9-45d2-8cfb-a4f13ae448a0 hdl:21.14100/ad8f99d7-2a9c-4cc4-977e-304127d2e686 hdl:21.14100/56d5e05d-fe10-470f-88c5-31cdc101d534 hdl:21.14100/a80cee19-16fc-4699-a400-1d41c302f2ff hdl:21.14100/01014f2e-4fab-44a5-980d-fde3476ee6f8 hdl:21.14100/bf454391-5646-49ef-8bd7-b5aa521ff154 hdl:21.14100/79614c61-3be9-4e6a-994e-cfcb5b4b6ab5
    version_id :
    v20180701

Questions?
¶